Skip to content

Conversation

andrewjstone
Copy link
Contributor

This builds on #8741

An alarm represents a protocol invariant violation. It's unclear exactly what should be done about these other than recording them and allowing them to be reported upstack, which is what is done in this PR. An argument could be made for "freezing" the state machine such that trust quorum nodes stop working and the only thing they can do is report alarm status. However, that would block the trust quorum from operating at all, and it's unclear if this should cause an outage on that node.

I'm also somewhat hesitant to put the alarms into the persistent state as that would prevent unlock in the case of a sled/rack reboot.

On the flip side of just recording is the possible danger resulting from operating with an invariant violation. This could potentially be risky, and since we shouldn't ever see these maybe pausing for a support call is the right thing. TBD, once more work is done on the protocol.

@andrewjstone andrewjstone force-pushed the tq-alarms branch 3 times, most recently from 8c5b6bd to ad388eb Compare August 2, 2025 23:39
@andrewjstone andrewjstone mentioned this pull request Aug 8, 2025
Base automatically changed from tq-reconfigure to main August 27, 2025 19:35
An alarm represents a protocol invariant violation. It's unclear exactly
what should be done about these other than recording them and allowing
them to be reported upstack, which is what is done in this PR. An
argument could be made for "freezing" the state machine such that trust
quorum nodes stop working and the only thing they can do is report alarm
status. However, that would block the trust quorum from operating at
all, and it's unclear if this should cause an outage on that node.

I'm also somewhat hesitant to put the alarms into the persistent state
as that would prevent unlock in the case of a sled/rack reboot.

On the flip side of just recording is the possible danger resulting from
operating with an invariant violation. This could potentially be risky,
and since we shouldn't ever see these maybe pausing for a support call
is the right thing. TBD, once more work is done on the protocol.
It's not actually an error to receive a `CommitAdvance` while
coordinating for the same epoch. The `GetShare` from the coordinator
could have been delayed in the network` and the node that received it
already committed before the coordinator knew it was done preparing. In
essence, the following would happen:

1. The coordinator would send GetShare requests for the prior epoch
2. Enough nodes would reply so that the coordinator would start sending
prepares.
3. Enough nodes would ack prepares to commit
4. Nexus would poll and send commits. Other nodes would get those
commits, but not the coordinator
5. A node that hadn't yet received the `GetShare` would get
a `CommitAdvance` or see the `Commit` from nexus and get it's
configuration and recompute it's own share and commit. It may have been
a prior coordinator with delayed deliveries to other nodes of `GetShare`
messages.
6. The node that just committed finally receives the `GetShare` and
sends back a `CommitAdvance` to the coordinator

This is all valid, and was similar to a proptest counterexample
@andrewjstone andrewjstone enabled auto-merge (squash) August 27, 2025 22:25
@andrewjstone andrewjstone merged commit d4df3f7 into main Aug 27, 2025
16 checks passed
@andrewjstone andrewjstone deleted the tq-alarms branch August 27, 2025 23:09
andrewjstone added a commit that referenced this pull request Aug 28, 2025
This PR builds on #8753 

This is a hefty PR, but it's not as bad as it looks. Over 4k lines of it
is in the example log file in the second commit. There's also some moved
and unmodified code that I'll point out.

This PR introduces a new test tool for the trust-quorum protocol:
tqdb. tqdb is a repl that takes event traces produced by the `cluster`
proptest and uses them for deterministic replay of actions against test
state.

The test state includes a "universe" of real protocol nodes, a fake
nexus, and fake networks. The proptest and debugging state is shared and
contained in the `trust-quorum-test-utils`.

The debugger allows a variety of functionality including stepping
through individual events, setting breakpoints, snapshotting and diffing
states and viewing the event log itself.

The purpose of tqdb is twofold:

  1. Allow for debugging of failed proptests. This is non-trivial in
     some cases, even with shrunken tests, because the generated
     actions are high-level and are all generated up front. The actual
     operations such as reconfigurations are derived from these high
     level random generations  in conjunction with the current state
     of the system. Therefore the set of failing generated actions
     doesn't really tell you much. You have to look at the logs, and
     the assertion that fired and reason about it with incomplete
     information. Now, for each concrete action taken, we record the
     event in a log. In the case of a failure an event log can be
     loaded into tqdb, with a breakpoint set right before the failure. A
     snapshot of the state can be taken, and then the failing event can
     be applied. The diff will tell you what changed and allow you to
     inspect the actual state of the system. Full visibility into your
     failure is now possible.

 2. The trust quorum protocol is non-trivial. Tqdb allows developers
    to see in detail how the protocol behaves and understand what is
    happening in certain situations. Event logs can be created by hand
    (or script) for particularly interesting scenarios and then run
    through tqdb.

In order to get the diff functionality to work as I wanted, I had to
implement `Eq` for types that implemented `subtle::ConstantTimeEq` in
both `gfss` (our secret sharing library), and `trust-quorum` crates.
However the safety in terms of the compiler breaking the constant
time guarantees is unknown. Therefore, a feature flag was added
such that only `test-utils` and `tqdb` crates are able to use these
implementations. They are not used in the production codebase. Feature
unification is not at play here because neither `test-utils` or `tqdb`
are part of the product.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant